AMalik03_FinalHomeworkCode_03

Some of my best friends are Zombies…


Original Code Reflection

I enjoyed most of this homework, but some of it was a struggle. I really enjoyed trying to figure out how to make my own functions despite how frustrating it was. I struggled with question 6 because I misread the question and was trying to plot at 100 X 30 data points into a vector of vectors but realized I only needed the means of each random sample. I also struggled with finding a way to add to vectors. I used the internet and found the function “append” to make changes to a vector. Finally I needed to refresh my memory on how to analyze qq plots because I didn’t know how to use the graph. I reread Module 8 to grab a better understanding of quantile-quantile plots and its uses.I didn’t use the full capabilities of ggplot2 because I wanted to make sure I did the homework correctly. I plan on adding color and prettiness for the final submission.

Final Code Reflection

Ritika’s comments and review of my code really helped me improve and find errors in my homework assignment. I realized through her comments and my peer commentary on Jess’s code that I didn’t do Question 5 right. Additionally, in question 3, when a scatter plot between weight in relation to age, Ritika believes there is a weak positive correlation. I personally think there is no relation but I used the corr.test function from Module 12 to find the correlation coefficient and changed my answer because there is indeed a weak correlation. Finally, based on Jess’s code, I realized left and right skewed is not a distribution, it’s a description of the graph, so I had to change my explanations for question 4. I also focused my final code on clarifying my explanations as well as making all my plots pretty.

1. Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education).

library(curl)
## Using libcurl 7.79.1 with LibreSSL/3.3.6
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::parse_date() masks curl::parse_date()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
f <-curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv")
d <-read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d) #make sure it's set up right
##   id first_name last_name gender   height   weight zombies_killed
## 1  1      Sarah    Little Female 62.88951 132.0872              2
## 2  2       Mark    Duncan   Male 67.80277 146.3753              5
## 3  3    Brandon     Perez   Male 72.12908 152.9370              1
## 4  4      Roger   Coleman   Male 66.78484 129.7418              5
## 5  5      Tammy    Powell Female 64.71832 132.4265              4
## 6  6    Anthony     Green   Male 71.24326 152.5246              1
##   years_of_education                           major      age
## 1                  1                medicine/nursing 17.64275
## 2                  3 criminal justice administration 22.58951
## 3                  1                       education 21.91276
## 4                  6                  energy studies 18.19058
## 5                  3                       logistics 21.10399
## 6                  4                  energy studies 21.48355
(mean.height <- mean(d$height))
## [1] 67.6301
(mean.weight <- mean(d$weight))
## [1] 143.9075
(mean.age <- mean(d$weight))
## [1] 143.9075
(mean.kills <- mean(d$zombies_killed))
## [1] 2.992
(mean.edu <- mean(d$years_of_education))
## [1] 2.996
pop.sd <- function(x){ #create a function for population SD
  sqrt(
    sum(
      (x - mean(x))^2
    )
    / length(x)
  )
}
(height.sd <- pop.sd(d$height))
## [1] 4.30797
(weight.sd <- pop.sd(d$weight))
## [1] 18.39186
(age.sd <- pop.sd(d$age))
## [1] 2.963583
(kills.sd <- pop.sd(d$zombies_killed))
## [1] 1.747551
(edu.sd <- pop.sd(d$years_of_education))
## [1] 1.675704

2. Use {ggplot} to make boxplots of each of these variables by gender.

heightplot <- ggplot(data = d, aes(x = gender, y = height, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Height(inches)") + labs(title = "Boxplot of Height by Gender") + theme(plot.title = element_text(hjust =0.5)) 
heightplot

#Followed Jess's code to center the titles. THANKS JESS!
weightplot <- ggplot(data = d, aes(x = gender, y = weight, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Weight(pounds)") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5)) 
weightplot

ageplot <- ggplot(data = d, aes(x = gender, y = age, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Age(years)") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5)) 
ageplot

killsplot <- ggplot(data = d, aes(x = gender, y = zombies_killed, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Zombies Killed") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5)) 
killsplot

eduplot <- ggplot(data = d, aes(x = gender, y = years_of_education, fill = gender)) + geom_boxplot() + xlab("Gender") + ylab("Years of Education") + labs(title = "Boxplot of Weight by Gender") + theme(plot.title = element_text(hjust =0.5)) 
eduplot

## 3. Use {ggplot} to make scatterplots of height and weight in relation to age. Do these variables seem to be related? In what way?

scatplotheight <- ggplot(data = d, aes(x = age, y = height)) + geom_point(colour = "brown4") + xlab("Age(years)") + ylab("Height(in)") + labs(title = "Relationship Between Age and Height") + theme(plot.title = element_text(hjust =0.5)) 
scatplotheight

(cor.test(d$age, d$height))
## 
##  Pearson's product-moment correlation
## 
## data:  d$age and d$height
## t = 26.905, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6109440 0.6829231
## sample estimates:
##       cor 
## 0.6483801


According to the scatterplot, there is a moderate positive correlation between the ages and the height of the survivors because the Pearson correlation coefficient is calculates to be 0.648.

scatplotweight <- ggplot(data = d, aes(x = age, y = weight)) + geom_point(colour = "maroon") + xlab("Age(years)") + ylab("Weight(lb)") + labs(title = "Relationship Between Age and Weight") + theme(plot.title = element_text(hjust =0.5)) 
scatplotweight

(cor.test(d$age, d$weight))
## 
##  Pearson's product-moment correlation
## 
## data:  d$age and d$weight
## t = 10.682, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2635609 0.3748694
## sample estimates:
##       cor 
## 0.3203203


According to the scatter plot, there is a weak positive relationship between the age of survivors and their weight because the Pearson correlation coefficient is calculated to be 0.320.

4. Using histograms and Q-Q plots, check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?

plot.hist.qq <- function(Variable){ #Made function to plot histogram and qqplot to make plotting easier
  par(mfrow = c(1,2))
  plot(hist(Variable, plot = FALSE), col = "maroon", main = c("Histogram of Variable")) #Plots a basic histogram of variable y
  qqnorm(Variable, frame = FALSE, col = "brown") #Plots a qq graph of data points
  qqline(Variable) #plots a qq line 
}
plot.hist.qq(d$height)


Yes, based on the histogram and the qqplot, the height of the survivors are drawn from a normal distribution curve.

plot.hist.qq(d$weight)


Yes, based on the histogram and the qqplot, the weight of the survivors are drawn from a normal distribution curve.

plot.hist.qq(d$age)


Yes, based on the histogram and the qqplot, the age of the survivors are drawn from a normal distribution curve.

plot.hist.qq(d$zombies_killed)


No, based on the graphs, the data does not stem from a normal distributions. The graph can be described as right-skewed. I predict that the data is pulled from a Poisson distribution.

plot.hist.qq(d$years_of_education)


No, the data for survivors’ years of education is not drawn from a normal distribution. The graph can be described as right-skewed. I predict that the data is pulled from a Poisson distribution.

5. Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal.

(samplesized <- sample_n(d, size = 30, replace = FALSE))
##     id first_name last_name gender   height   weight zombies_killed
## 1  298     Howard    Martin   Male 68.85221 150.7262              6
## 2  730      Emily   Mendoza Female 65.49701 133.6932              4
## 3  177       Juan    Barnes   Male 62.33876 133.5535              3
## 4  439   Kimberly    Graham Female 66.38030 130.4227              2
## 5  264      Craig  Stephens   Male 70.27466 137.8759              3
## 6  304     Robert    Parker   Male 73.60436 161.1474              3
## 7  963   Margaret      Rice Female 66.41046 143.8121              2
## 8  315      Tammy    Martin Female 65.69532 123.4174              5
## 9  562     Sharon      Carr Female 70.76667 169.6664              4
## 10 165       Jane    Graham Female 69.06732 140.7583              5
## 11 793        Amy    Taylor Female 64.40630 113.9125              3
## 12  10    Russell    Chavez   Male 72.81769 168.5616              2
## 13 601       Judy   Stewart Female 62.69921 120.4426              4
## 14 980    Melissa      Gray Female 64.19924 135.2534              3
## 15 217       Fred    Butler   Male 69.85855 149.5380              0
## 16 300      Helen      Cook Female 71.68438 157.6903              4
## 17 671       Lisa    Harper Female 65.79459 134.8015              2
## 18 708    Gregory      Gray   Male 70.10855 145.2425              3
## 19 898       Sean    Romero   Male 69.87416 169.4200              5
## 20 294      Laura      Ross Female 63.18053 127.1017              1
## 21 312      Paula      Hall Female 60.34227 114.5168              2
## 22 879   Jennifer   Daniels Female 68.95607 152.1361              6
## 23 281     Philip Carpenter   Male 72.58839 147.1452              4
## 24 141        Ann      Dunn Female 65.36311 143.3415              3
## 25   3    Brandon     Perez   Male 72.12908 152.9370              1
## 26 394       Adam  Gonzales   Male 79.48964 162.3697              1
## 27 766     Carlos      Hart   Male 69.70424 162.2236              2
## 28 764      Terry     Myers   Male 65.15692 137.0728              6
## 29 659       Jack  Castillo   Male 71.94457 158.1921              3
## 30 288       Juan    Murray   Male 73.70542 163.5918              6
##    years_of_education                           major      age
## 1                   4                   city planning 19.26587
## 2                   2 criminal justice administration 22.05847
## 3                   1          mechanical engineering 14.61176
## 4                   5                      philosophy 23.76154
## 5                   3                animal husbandry 21.62270
## 6                   3                       economics 23.85202
## 7                   3                       economics 20.89564
## 8                   4                       economics 20.38294
## 9                   1                      psychology 21.12729
## 10                  2                       economics 21.46047
## 11                  6                      philosophy 18.06166
## 12                  3               military strategy 21.50798
## 13                  3                       economics 16.60652
## 14                  3           agricultural sciences 21.61279
## 15                  7                   communication 18.18100
## 16                  5                      philosophy 26.25881
## 17                  3               military strategy 18.73178
## 18                  5                animal husbandry 22.62593
## 19                  2               culinart services 18.99324
## 20                  5           environmental science 24.37250
## 21                  3                animal husbandry 17.17613
## 22                  2                       education 22.93477
## 23                  3         business administration 22.37798
## 24                  2              physical education 15.65899
## 25                  1                       education 21.91276
## 26                  4              physical education 26.11363
## 27                  2                  human services 17.81768
## 28                  2                    epidemiology 14.99672
## 29                  1         business administration 22.77554
## 30                  4                medicine/nursing 20.28010
(mean.height.s <- mean(samplesized$height))
## [1] 68.42967
(mean.weight.s <- mean(samplesized$weight))
## [1] 144.6855
(mean.age.s <- mean(samplesized$age))
## [1] 20.60117
(mean.kills.s <- mean(samplesized$zombies_killed))
## [1] 3.266667
(mean.edu.s <- mean(samplesized$years_of_education))
## [1] 3.133333
(sd.height.s <- pop.sd(samplesized$height))
## [1] 4.150541
(sd.weight.s <- pop.sd(samplesized$weight))
## [1] 15.87423
(sd.age.s <- pop.sd(samplesized$age))
## [1] 2.99947
(sd.kills.s <- pop.sd(samplesized$zombies_killed))
## [1] 1.611073
(sd.edu.s <- pop.sd(samplesized$years_of_education))
## [1] 1.49963
standarderror <- function(sd) {
  sd/sqrt(30)
}

(semheight <- standarderror(sd.height.s))
## [1] 0.7577816
(semweight <-standarderror(sd.weight.s))
## [1] 2.898225
(semage <-standarderror(sd.age.s))
## [1] 0.5476258
(semkills <-standarderror(sd.kills.s))
## [1] 0.2941403
(semedu <-standarderror(sd.edu.s))
## [1] 0.2737937
xbar = NULL
sd= NULL
normbounds <- function(xbar, sem){
  lower.bound <- xbar - qnorm(1-0.05/2) * sem
  upper.bound <- xbar + qnorm(1-0.05/2) * sem
  print(c(lower.bound, upper.bound))
}
  
normbounds(mean.height.s, sd.height.s)
## [1] 60.29476 76.56458
normbounds(mean.weight.s, sd.weight.s)
## [1] 113.5725 175.7984
normbounds(mean.age.s, sd.age.s)
## [1] 14.72232 26.48003
#Made bounds function using Module 9
poisbounds <- function(x){
  poisson.test(length(x))
}
poisbounds(mean.kills.s)
## 
##  Exact Poisson test
## 
## data:  length(x) time base: 1
## number of events = 1, time base = 1, p-value = 1
## alternative hypothesis: true event rate is not equal to 1
## 95 percent confidence interval:
##  0.02531781 5.57164339
## sample estimates:
## event rate 
##          1
poisbounds(mean.edu.s)
## 
##  Exact Poisson test
## 
## data:  length(x) time base: 1
## number of events = 1, time base = 1, p-value = 1
## alternative hypothesis: true event rate is not equal to 1
## 95 percent confidence interval:
##  0.02531781 5.57164339
## sample estimates:
## event rate 
##          1

6. Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable. What are the means and standard deviations of this distribution of means for each variable? How do the standard deviations of means compare to the standard errors estimated in [5]? What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?

meanheights <- c(mean.height.s)
meanweights <- c(mean.weight.s)
meanages <- c(mean.age.s)
meankills <- c(mean.kills.s)
meanedus <- c(mean.edu.s)



for (i in 1:99) { #create a for loop that adds the mean vectors of the quantitative daae to a vector of means, a total of 99 loops for 100 samples.
  rsample <- sample_n(d, size = 30, replace = FALSE)
  meanheights <- append(meanheights, mean(rsample$height))
  meanweights <- append(meanweights, mean(rsample$weight))
   meanages <- append(meanages, mean(rsample$age))
   meankills <- append(meankills, mean(rsample$zombies_killed))
   meanedus <- append(meanedus, mean(rsample$years_of_education))
}

mean(meanheights)
## [1] 67.55221
pop.sd(meanheights) #Use pop.sd because this set of 30 is our sample
## [1] 0.7882015
mean(meanweights)
## [1] 143.484
pop.sd(meanweights)
## [1] 3.166218
mean(meanages)
## [1] 20.05964
pop.sd(meanages)
## [1] 0.5480569
mean(meankills)
## [1] 2.980667
pop.sd(meankills)
## [1] 0.3160935
mean(meanedus)
## [1] 2.977


The standard deviations are smaller from this population are smaller than that of the one from Question 5. Also, he standard deviation of the means are very close to the standard error calculated

par(mfrow = c(1,2))
plot.hist.qq(meanheights)

plot.hist.qq(meanweights)

plot.hist.qq(meanages)

plot.hist.qq(meankills)

plot.hist.qq(meanedus)


Based on the histograms and qq plots, all the quantitative data seem to be drawn from a normal distribution.